Fix launcher job scheduling directives when unsuspending#772
Fix launcher job scheduling directives when unsuspending#772GonzaloSaez wants to merge 1 commit intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@GonzaloSaez could you sign DCO? |
@GonzaloSaez Could you keep the current mechanism (creating a batch/v1 Job even when the MPIJob is suspended)? |
You can follow https://github.com/kubeflow/mpi-operator/pull/772/checks?check_run_id=63645778871 steps to sign DCO. |
Signed-off-by: GonzaloSaez <11050889+GonzaloSaez@users.noreply.github.com>
880261d to
fe8d324
Compare
tenzen-y
left a comment
There was a problem hiding this comment.
@GonzaloSaez Thank you for working on this problem.
Basically, LGTM.
Additionally, could you add an integration test case to https://github.com/kubeflow/mpi-operator/blob/master/test/integration/mpi_job_controller_test.go?
| // so we must clear it first via a status sub-resource update (consistent with JobSet). | ||
| if launcher.Status.StartTime != nil { | ||
| launcher.Status.StartTime = nil | ||
| if _, err := c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil { |
There was a problem hiding this comment.
| if _, err := c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil { | |
| var err error | |
| if launcher, err = c.kubeClient.BatchV1().Jobs(namespace).UpdateStatus(context.TODO(), launcher, metav1.UpdateOptions{}); err != nil { |
Could you update launcher after startTime update to avoid coflict while scheduling directive update?
| // syncLauncherSchedulingDirectives updates the mutable scheduling directives (as per KEP-2926) on | ||
| // the launcher Job's pod template to match the desired template. | ||
| func syncLauncherSchedulingDirectives(launcher *batchv1.Job, desired *corev1.PodTemplateSpec) { | ||
| if launcher.Spec.Template.Labels == nil { |
There was a problem hiding this comment.
| if launcher.Spec.Template.Labels == nil { | |
| if desired.Labels != nil && launcher.Spec.Template.Labels == nil { |
Optimizing initialization would be better.
| // the launcher Job's pod template to match the desired template. | ||
| func syncLauncherSchedulingDirectives(launcher *batchv1.Job, desired *corev1.PodTemplateSpec) { | ||
| if launcher.Spec.Template.Labels == nil { | ||
| launcher.Spec.Template.Labels = make(map[string]string) |
There was a problem hiding this comment.
| launcher.Spec.Template.Labels = make(map[string]string) | |
| launcher.Spec.Template.Labels = make(map[string]string, len(desired.Labels)) |
| if desired.Annotations != nil { | ||
| if launcher.Spec.Template.Annotations == nil { | ||
| launcher.Spec.Template.Annotations = make(map[string]string) | ||
| } | ||
| for k, v := range desired.Annotations { | ||
| launcher.Spec.Template.Annotations[k] = v | ||
| } | ||
| } |
There was a problem hiding this comment.
| if desired.Annotations != nil { | |
| if launcher.Spec.Template.Annotations == nil { | |
| launcher.Spec.Template.Annotations = make(map[string]string) | |
| } | |
| for k, v := range desired.Annotations { | |
| launcher.Spec.Template.Annotations[k] = v | |
| } | |
| } | |
| if desired.Annotations != nil && launcher.Spec.Template.Annotations == nil { | |
| launcher.Spec.Template.Annotations = make(map[string]string) | |
| } | |
| for k, v := range desired.Annotations { | |
| launcher.Spec.Template.Annotations[k] = v | |
| } |
The range loop will be executed only when the desired.Annotaions are not null.
This should address #770.
If an MPIJob is suspended and then unsuspended (i.e. like Kueue would do during workload creation or when preemption occurs), the launcher job would not have the correct scheduling directives after launch job unsuspension. We need to perform the same operations as JobSet does: https://github.com/kubernetes-sigs/jobset/blob/f1bbaaef64b2a56c4721843b1d83750d21227948/pkg/controllers/jobset_controller.go#L537